CSDA Lab, Mathematics and Statistics Department, University of West Florida
High dimensional data refers to data with large number of features (co-variates) \(p\), formally we can write data is \(\mathbf{X} \in \mathbb{R} ^{n\times p}\):
\[ p \gg n, \tag{1} \] where \(n\) is the number of observations.
In this context, many challenges arise:
Some solutions in the literature:
Our data is a mass spectrum signal data (functional data).
The Fourier Transform of a signal \(x(t)\) can be expressed as:
\[ X(f)= \int_{-\infty}^{\infty} x(t) e^{i2 \pi ft} dt \tag{2} \](\(e^{ix}= \cos x + i \sin x\), Euler’s formula); \(f\) is the frequency domain.
The Wavelet Transform of a signal \(x(t)\) can be given as:
\[ WT(s,\tau)= \frac{1}{\sqrt s}\int_{-\infty}^{\infty} x(t) \psi^*\big(\frac{t-\tau}{s}\big) dt, \tag{3} \]
where \(\psi^*(t)\) denotes the complex conjugate of the base wavelet \(\psi(t)\)); \(s\) is the scaling parameter, and \(\tau\) is the location parameter.
Example: Morlet Wavelet \(\psi(t) = e^{i2 \pi f_0t} e^{-(\alpha t^2/\beta^2)}\), with the parameters \(f_0\), \(\alpha\), \(\beta\) all being constants.
Ovarian cancer detection (Yu et al. 2005): A combination of Binning, Kolmogorov-Smirnov test, discrete wavelet transform, and support vector machines.
Proteomic profile with bi-orthogonal discrete wavelet transform (Schleif et al. 2009): A combination of outlier detection-centroied-based, recalibration, baseline correction (top-hat filter), Kolmogorov-Smirnov test, discrete wavelet transform bior 3.7, and support vector machines.
Ovarian cancer detection using peaks and discrete wavelet transform (Du et al. 2009): A combination of discrete wavelet transform, thresholding, peak detection using MAD, Kolmogorov-Smirnov test, bagging predictor.
Ovarian cancer classification using wavelets and genetic algorithm (Nguyen et al. 2015): A combination of Haar discrete wavelet transform, genetic algorithms.
Breat cancer mass spectrum classification (Cohen, Messaoudi, and Badir 2018): A combination of segmentation, discrete wavelet transform, statistical features on the coefficients, PCA-T2 Hotelling statistic, SVM.
Ovarian cancer mass spectrum classification (Vimalajeewa, Bruce, and Vidakovic 2023): A combination of Daubechies-7 wavelet transform, sample variance and distance variance, Fisher’s criterion for feature extraction, SVM, KNN, and Logistic regression.
A workflow for ML is the following:
Data Collection
Data Processing: Clean, Explore, Prepare, Transform
Modeling: Develop, Train, Validate, and Evaluate,
Deployment: Deploy, Monitor and Update
Go to 1.
We designed a statistical experiment to evaluation 4 different processing approaches.
Variables of the experimental design:
Four pre-processing techniques.
5 window sizes.
Two are wavelet-based and two are not.
10 wavelets families
Four ML Models: Logistic Regression, Support Vector Machine, Random Forest, and XGboost.
Two sampling: up and no sampling to overcome the imbalance classes
A total of 8800 models were run.
Processing 1 (PROC1): The feature space includes mean, variance, energy, coefficient of variation, Skewness, and Kurtosis; wavelet transform.
Processing 2 (PROC2): Same as PROC1 but the feature space will include the first 10 autocorrelation coefficients.
Processing 3 (PROC3): Same as PROC1 but without the wavelet transform.
Processing 4 (PROC4): Same as PROC2 but without the wavelet transform.
Consider a normalized data matrix, with \(p\) variables and \(N\) observations:
\[ \textbf{Z}= \left(\begin{array}{cccc}z_{11} & z_{12} & \dots &z_{1p}\\ z_{21} & z_{22} & \dots &z_{2p}\\ \vdots & \vdots & \ddots & \vdots \\ z_{N1} & z_{N2} & \dots & z_{Np}\end{array}\right) \]
The covariance matrix of \(Z\) can be approximated as:
\[ S= \frac{1}{N-1}Z^{T}Z= P \Lambda P^{T} \]
where \(\Lambda= diag (\lambda_1, \lambda_2, ..., \lambda_p)\) with \(\lambda_1 \ge \lambda_2 \ge ..., \ge \lambda_p\). \(\lambda_i\) are the eigenvalues and P are the eigenvectors of \(S\).
According to \(\lambda_i\)’s, \(P\) and \(\Lambda\) could be divided into a feature space () and a residual space (). We can then rewrite \(P\) and \(\Lambda\) as follows:
\[ P= \left[\begin{array}{cc} P_{feat} & P_{res}\end{array}\right] \]
\[ \Lambda= \left[\begin{array}{cc}\Lambda_{feat} & 0 \\ 0 & \Lambda_{res}\end{array}\right] \]
The Hotelling \(T^2\) statistic can then be computed as follows:
\[ {T^2}= Z P_{feat} \Lambda^{-1}_{feat} P^T_{feat} Z^{T} \tag{4} \]
where \(T^2\) is the Hotelling statistic calculated into the multivariate feature space of the principal component analysis, and \(P^T\) is the transpose of \(P\).
The performance metrics utilized were:
Observed 32,768 m/z values / 33,885 m/z values
Link: https://bioinformatics.mdanderson.org/public-datasets/
PROC3 seems to have the highest values across all models and metrics.
The larger the sample size, the higher the average accuracy. (not controlling for other variables)
The wavelet coefficients do not always lead to improved performance.
Additionally, autocorrelations do not appear to be effective predictors, which is anticipated for wavelet coefficients.
Future work
Joint Mathematics Meetings | Jan 8-11, 2025 | Seattle